Search results for "text indexes"

showing 3 items of 3 documents

Lightweight BWT Construction for Very Large String Collections

2011

A modern DNA sequencing machine can generate a billion or more sequence fragments in a matter of days. The many uses of the BWT in compression and indexing are well known, but the computational demands of creating the BWT of datasets this large have prevented its applications from being widely explored in this context. We address this obstacle by presenting two algorithms capable of computing the BWT of very large string collections. The algorithms are lightweight in that the first needs O(m log m) bits of memory to process m strings and the memory requirements of the second are constant with respect to m. We evaluate our algorithms on collections of up to 1 billion strings and compare thei…

SequenceTheoretical computer scienceConstant (computer programming)BWTtext indexesComputer scienceString (computer science)Search engine indexingProcess (computing)Context (language use)next-generation sequencingAlphabetBWT; text indexes; next-generation sequencing

researchProduct

Lightweight algorithms for constructing and inverting the BWT of string collections

2013

Recent progress in the field of \{DNA\} sequencing motivates us to consider the problem of computing the Burrows‚ÄìWheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the \{BWT\} at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the \{BWT\} for 100 billion characters or more of data remains a computational challenge. In this work we ad…

SequenceTheoretical computer scienceSettore INF/01 - InformaticaGeneral Computer ScienceComputer scienceString (computer science)Search engine indexingProcess (computing)Data_CODINGANDINFORMATIONTHEORYData structureField (computer science)Theoretical Computer ScienceBWTConstant (computer programming)Text indexeBWT; Text indexes; Next-generation sequencingText indexesNext-generation sequencingAlphabetAlgorithmAuxiliary memoryTheoretical Computer Science

researchProduct

Lightweight LCP construction for next-generation sequencing datasets

2012

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and B…

Whole genome sequencingGenomics (q-bio.GN)FOS: Computer and information sciencesSequenceBWT; LCP; next-generation sequencing datasetsBWT LCP text indexes next-generation sequencing datasets massive datasetsSettore INF/01 - InformaticaComputer scienceComputationString (computer science)LCP arrayParallel computingData structureDNA sequencingSubstringBWTLCPFOS: Biological sciencesComputer Science - Data Structures and AlgorithmsQuantitative Biology - GenomicsData Structures and Algorithms (cs.DS)next-generation sequencing datasets

researchProduct